Distances between Distributions: Comparing Language Models

نویسندگان

  • Thierry Murgue
  • Colin de la Higuera
چکیده

Language models are used in a variety of fields in order to support other tasks: classification, next-symbol prediction, pattern analysis. In order to compare language models, or to measure the quality of an acquired model with respect to an empirical distribution, or to evaluate the progress of a learning process, we propose to use distances based on the L2 norm, or quadratic distances. We prove that these distances can not only be estimated through sampling, but can be effectively computed when both distributions are represented by stochastic deterministic finite automata. We provide a set of experiments showing a fast convergence of the distance through sampling and a good scalability, enabling us to use this distance to decide if two distributions are equal when only samples are provided, or to classify texts.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On the Blocks of Interpoint Distances

We study the blocks of interpoint distances, their distributions, correlations, independence and the homogeneity of their total variances. We discuss the exact and asymptotic distribution of the interpoint distances and their average under three models and provide connections between the correlation of interpoint distances with their vector correlation and test of sphericity. We discuss testing...

متن کامل

Wasserstein distances for discrete measures and convergence in nonparametric mixture models

We consider Wasserstein distance functionals for comparing between and assessing the convergence of latent discrete measures, which serve as mixing distributions in hierarchical and nonparametric mixture models. We explore the space of discrete probability measures metrized by Wasserstein distances, clarify the relationships between Wasserstein distances of mixing distributions and f -divergenc...

متن کامل

On the Computation of Distances for Probabilistic Context-Free Grammars

Probabilistic context-free grammars (PCFGs) are used to define distributions over strings, and are powerful modelling tools in a number of areas, including natural language processing, software engineering, model checking, bio-informatics, and pattern recognition. A common important question is that of comparing the distributions generated or modelled by these grammars: this is done through che...

متن کامل

Using Graph and Vertex Entropy to Measure Similarity of Empirical Graphs with Theoretical Graph Models

Over the years, several theoretical graph generation models have been proposed. Among the most prominent are: Erdős-Renyi random graph model, Watts-Strogatz small world model, Albert-Barabási preferential attachment model, Price citation model, and many more. Often, researchers working on an empirical graph want to know, which of the theoretical graph generation models is the closest, i.e., whi...

متن کامل

An exact distribution-free test comparing two multivariate distributions based on adjacency

A new test is proposed comparing two multivariate distributions by using distances between observations. Unlike earlier tests using interpoint distances, the new test statistic has a known exact distribution and is exactly distribution free. The interpoint distances are used to construct an optimal non-bipartite matching, i.e. a matching of the observations into disjoint pairs to minimize the t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004